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Abstract. We study the termination problem of the chase algorithm, a 
central tool in various database problems such as the constraint implica- 
tion problem, Conjunctive Query optimization, rewriting queries using 
views, data exchange, and data integration. The basic idea of the chase 
is, given a database instance and a set of constraints as input, to fix con- 
straint violations in the database instance. It is well-known that, for an 
arbitrary set of constraints, the chase does not necessarily terminate (in 
general, it is even undecidable if it does or not). Addressing this issue, 
we review the limitations of existing sufficient termination conditions for 
the chase and develop new techniques that allow us to establish weaker 
sufficient conditions. In particular, we introduce two novel termination 
conditions called safety and inductive restriction, and use them to define 
the so-called T-hierarchy of termination conditions. We then study the 
interrelations of our termination conditions with previous conditions and 
the complexity of checking our conditions. This analysis leads to an algo- 
rithm that checks membership in a level of the T-hierarchy and accounts 
for the complexity of termination conditions. As another contribution, we 
study the problem of data- dependent chase termination and present suffi- 
cient termination conditions w.r.t. fixed instances. They might guarantee 
termination although the chase does not terminate in the general case. 
As an application of our techniques beyond those already mentioned, 
we transfer our results into the field of query answering over knowledge 
bases where the chase on the underlying database may not terminate, 
making existing algorithms applicable to broader classes of constraints. 



1 Introduction 

The chase procedure is a fundamental algorithm that has been successfully ap- 
plied in a variety of database applications |8ll3l4ll2ll5l21l2llll9j . Originally pro- 
posed to tackle the implication problem for data dependencies [814] and to opti- 
mize Conjunctive Queries (CQs) under data dependencies [3113] . it has become 
a central tool in Semantic Query Optimization (SQO) [2011122] . For instance, 

* The work of this author was funded by DFG grant GRK 806/03. 



the chase can be used to enumerate minimal CQs under a set of dependen- 
cies pQ, thus supporting the search for more efficient query evaluation plans. 
Beyond SQO, the chase algorithm has been applied in many other contexts, 
such as data exchange [31], peer data exchange 0, data integration [IS], query 
answering using views [12] . and probabilistic databases [19] . 

The core idea of the chase algorithm is simple: given a set of dependencies 
(also called constraints) over a database schema and a fixed database instance 
as input, it fixes constraint violations in the instance. As a minimal and in- 
tuitive scenario we consider a database graph schema that provides a relation 
E(src, dst), which stores directed edges from node src to node dst, and a node 
relation S(n) containing nodes with some distinguished properties, which are en- 
forced by constraints. These constraints will vary from example to example and 
we denote nodes in S as special nodes in the following. We sketch the idea of the 
chase algorithm using a single constraint a\ :— \/x(S(x) — > 3yE(x,y)), stating 
that each special node has at least one outgoing edge. Now consider the sample 
database instance / := {S{ni), S{n 2 ), E(ni, n 2 )}. It is easy to see that / does 
not satisfy et\, because it does not contain an outgoing edge for special node n 2 . 
In its effort to fix the constraint violations in the database instance, the chase 
procedure would create the tuple t\ :— E(n,2, x\), where x\ is a fresh null value. 
The resulting database instance V := Ill{ti} now satisfies constraint ct\, so the 
chase terminates and returns /' as result. 

One major problem with the chase algorithm, however, is that it does not termi- 
nate in the general case. To give an idea of the problem, let us sketch a scenario 
that induces a non-terminating chase sequence. We replace the constraint ct\ 
from before by constraint a 2 '■= Vx(S(x) — > 3yE(x, y), S(y)), which asserts that 
each special node links to another special node. Now consider the instance / from 
before. Obviously, / does not satisfy a 2 , because special node n 2 has no outgoing 
edge. In response, the chase fixes this constraints violation by adding the two 
tuples E{n 2 ,X\) and S(x\) to /, where x\ is a fresh null value. Constraint a 2 
is then fixed w.r.t. value n 2 , but now the special node x\ introduced in the last 
chase step violates a 2 . In subsequent steps the chase would add E(x\,x 2 ), S{x 2 ), 
E{x 2 , £3), S(xa), . . . , where x 2 , 23, . . . are fresh null values. Hence, when given 
instance / and constraint a 2 as input, the chase procedure will never terminate. 
As shown in [5], in general it is undecidable if the chase terminates or not, even for 
a fixed instance. Still, addressing the issue of non-terminating chase sequences, 
several sufficient conditions for the input constraints have been proposed that 
guarantee termination on every database instance }21f9(22ll8] . The common idea 
is to statically assert that there are no positions in the database schema where 
fresh null values might be cyclically created in. The term position refers to a 
position in a relational predicate, e.g. E(src, dst) has two positions, namely src, 
denoted as E 1 , and dst, denoted as E 2 . Likewise, we denote by S 1 the only 
position in predicate S. The non-terminating chase sequence discussed before, 
for instance, cyclically creates fresh null values in positions E 1 and S 1 . 
One well-known termination condition is weak acyclicity [21 . Roughly spoken, it 
implements a global study of the input constraints, to detect cyclically connected 



positions in the constraint set that introduce some fresh null values. In [3], strat- 
ification was introduced which meant to generalize weak acyclicity, claiming that 
it suffices to assert weak acyclicity locally for subsets of constraints that might 
cyclically cause to fire each other. We will show that stratification, unlike stated 
by the authors of [9], does not generally ensure the termination of the chase, 
yet, as a central contribution, we can prove that it ensures the termination of at 
least one chase sequence. Moreover, we show that this sequence can be statically 
determined from the chase graph. It is important to notice that the techniques 
introduced in |21l9j take only the constraints into account and not the database 
instance. We therefore call such termination conditions data-independent; their 
result is either the guarantee that the chase with these constraints terminates 
for every database instance or that no predictions can be made. 

This paper explores sufficient termination conditions beyond the corrected ver- 
sion of stratification, which (by the best of our knowledge) is the most general 
termination condition known so far. As one major contribution, we study data- 
independent chase termination and present conditions that generalize stratifica- 
tion. Complementary, we consider the novel problem of data-dependent chase 
termination, where our goal is to derive chase termination guarantees w.r.t. a 
fixed instance. In the remainder of the Introduction we summarize the key con- 
cepts and ideas of our analysis and survey the main results. 

Data-independent chase termination. As discussed before, the source of 
non-terminating chase sequences are fresh null values that are cyclically created 
at runtime in some position(s). We develop new techniques that allow us to stat- 
ically approximate the set of positions where null values are created in or copied 
to during chase application and use them to develop a hierarchy of sufficient 
termination conditions that are strictly more general than stratification. Our 
termination conditions rely on the following ideas. 

(1) Correction and exploration of the stratification condition: We show that strat- 
ification does not generally ensure termination of every chase sequence, as stated 
by the authors of [9], but of at least one chase sequence. Besides, we show that 
such a sequence can be statically determined independently of the input in- 
stance. This opens the door to the area of sufficient termination conditions for 
the chase that ensure, independently of the underlying data, the termination of 
at least one chase sequence and not necessarily of all. Furthermore, we propose a 
possible correction of the stratification condition which ensures the termination 
for every chase sequence, as intended by the authors of [pj, using the oblivious 
chase. 

(2) Identification of harmless null values: Often constraints introduce fresh null 
values in a certain position, but the (fixed) size of the database instance im- 
plies an upper bound on the number of null values that might be introduced in 
this position. Consider for example the constraint «3 := Vx,y(S(x), E(x,y) — > 
3zE(z, x)), which may create fresh null values in position E 1 . Whenever is 
part of a constraint set that does not copy null values to or create null values in 
position S , the number of fresh null values that might be introduced in position 



E 1 by «3 is implicitly fixed by the number of entries in relation S and constraint 
as cannot cause an infinite cascading of fresh null values in this position. 

(3) Supervision of the flow of null values: We statically approximate the set of 
positions where null values might be copied to during chase application, by a so- 
phisticated study of the interrelations between the individual constraints. Again, 
we illustrate the idea by a small and simple example. Let us consider the two con- 
straints fa :— Vx,y(S(x), E(x,y) — > E(y,x)) and fa '■= Va;,y(5(x), E{x,y) — > 
3zE(y, z), E(z, or)), which assert that each special node with an outgoing edge 
has cycles of length 2 and 3, respectively. We observe that none of these con- 
straints inserts fresh null values into relation S, so the chase will terminate as 
soon as fa and fa have been fixed for all special nodes with an outgoing edge, 
i.e. after a finite number of steps. Somewhat surprisingly, none of the existing 
conditions recognizes chase termination for the above scenario. The reason is 
that they do not supervise the flow of null values. Our approach exhibits such 
an analysis and would guarantee chase termination for the two constraints above. 

(4) Inductive decomposition of the constraint set: The constraint set in the 
previous example is not dangerous, because no fresh null values are created 
in position S . Let us, in addition to fa and fa, consider the constraint 
fa := Eta, yS(x), E(x, y), stating that there is at least one special node with 
an outgoing edge. Clearly, fa fires at most once, so the chase for the constraint 
set {fa, fa, fa} will still terminate. However, fa complicates the analysis because 
it "infects" position S 1 in the sense that now null values may be created in this 
position. We resolve such situations by an (inductive) decomposition of the con- 
straint set. When applied to the above example, our approach would recognize 
that fa is not cyclically connected with fa and fa , and decompose the constraint 
set into the subsets {fa, fa} and {fa}, which then are inspected recursively. 

Based upon the previous ideas we develop two novel sufficient chase termina- 
tion condition, called safety and inductive restriction. Figure [T] surveys our main 
results and relates them to the previous termination condition weak acyclicity 
and the corrected version of stratification that we call c-stratification. All classes 
in the figure guarantee chase termination in polynomial-time data complexity 
and all inclusion relationships are strict. As can be seen, safety generalizes weak 
acyclicity and is further generalized by inductive restriction. On top of induc- 
tively restricted constraints we then define an (infinite) hierarchy of sufficient 
termination conditions, which we call T-hierarchy. To give an intuition, for a 
fixed level in this hierarchy, say T[k], the idea is to study the flow and creation 
of fresh null values detailedly for chains of up to k constraints that might cause 
to fire each other in sequence. 

An algorithm. It can be checked in polynomial time if a constraint set is safe; in 
contrast, the recognition problem for inductively restricted constraints and the 
classes in the T-hierarchy is in CONP. We develop an efficient algorithm that 
accounts for the increasing complexity of the recognition problem and can be 
used to test membership of a constraint set in some fixed level of the T-hierarchy 
The underlying idea idea of our algorithm is to combine the different sufficient 



Fig. 1. Chase termination conditions. 



termination conditions, to reduce the complexity of checking for termination 
wherever possible. 

Data-dependent chase termination. Whenever the input constraint set does 
not fall into some fixed level of the T-hierarchy, no termination guarantees for 
the general case can be derived. Arguably reasonable applications should never 
risk non-termination, so the chase cannot be safely applied to any instance in 
this case. Tackling this situation, we study the novel problem of data- dependent 
chase termination: given constraint set S and a fixed instance /, does the chase 
with S terminate on 11 We argue that this setting particularly makes sense in 
the context of Semantic Query Optimization, where the query - interpreted as 
database instance - is chased: typically, the query is small, so the "data" part 
can be analyzed efficiently (as opposed to the case where the input is a large 
database instance). We propose two complementary approaches: 

1. Our first, static scheme relies on the observation that, if the instance is 
fixed, we can ignore constraints in the constraint set that will never fire 
when chasing the instance, i.e. if general sufficient termination guarantees 
hold for those constraints that might fire. As a fundamental result, we show 
that in general it is undecidable if a constraint will never fire on a fixed 
instance. Still, we give a sufficient condition that allows us to identify such 
constraints in many cases and derive a sufficient data-dependent condition. 



2. Whenever the static approach fails, our second, dynamic approach comes 
into play: we run the chase and track cyclically created fresh null values in 
a so-called monitor graph. We then fix the maximum depth of cycles in the 
monitor graph and stop the chase when this limit is exceeded: in such a case, 
no termination guarantees can be made. However, we show that each fixed 
search depth implicitly defines a class of constraint-instance pairs for which 
the chase terminates. Intuitively, the search depth limit can be seen as a 
natural condition that allows us to stop the chase when "dangerous" situa- 
tions arise. Under these considerations, our approach adheres to situations 
that are likely to cause non-termination, so it is preferable to blindly running 
the chase and aborting after a fixed amount of time, or a fixed number of 
chase steps. Applications might fix the search depth following a pay-as-you- 
go principle. Ultimately, the combination of our static and dynamic analysis 
constitutes a pragmatic workaround in all scenarios where no general (i.e., 
data-independent) termination guarantees can be made. 



Application. As a possible application of our techniques, we review the prob- 
lem of answering Conjunctive Queries over knowledge bases in the presence of 
constraints, with a focus on scenarios where the chase with the given constraint 
set does not necessarily terminate. This problem was first considered in [13] 
and recently generalized in |5|6j . A key idea in [5] is an overestimation of the 
set of positions in which null values might occur, using the concept of so-called 
affected positions. In particular, affected positions are used in [5] to define a 
class of constraints called weakly guarded constraint sets, for which the query 
answering problem is decidable. Using our novel techniques, we refine the no- 
tion of affected positions with the help of a so-called restriction system, which 
is a central tool in our study of data-independent chase termination, e.g. used 
to define the class of inductively restricted constraints and the T-hierarchy. We 
show that restriction systems can be fruitfully applied to generalize the class of 
weakly guarded constraints to a class we call restrictedly guarded constraints, 
thus making the algorithms in [5|6j applicable to a larger class of constraints. 

Structure. Section [2] presents the necessary background in databases. Next, 
Section [3] provides our results on data-independent chase termination. Its main 
results are the exploration/correction of the stratification condition, the intro- 
duction of the T-hierarchy and an algorithm to efficiently test membership of a 
constraint set in some level of the T-hierarchy. In Section |4] we then motivate 
the novel problem of data-dependent chase termination. As a possible applica- 
tion, Section [5] demonstrates the applicability of our concepts and methods in 
the context of query answering on knowledge bases where the chase may not 
terminate. We conclude with some closing remarks in Section [6l 

Additional remarks. This paper builds upon the ideas presented in the Ex- 
tended Abstract [17 . Other parts of this paper were informally published as 
technical reports |22|18j . 



2 Preliminaries 



General mathematical notation. The natural numbers N do not include 
0. For n 6 N, we denote by [n] the set {l,...,n}. For a set M, we denote 
by 2 M its powerset and by \M\ its cardinality. Abusing notation we denote 
by |s| also the length of a logical formula. Given a tuple t = (ti,...,t n ) we 
define the tuple obtained by projecting on positions 1 < ii < ■ ■ ■ < i m < n as 

Pii,...,i m (t) '■— {tii , ■ • ■ , ti m ). 

Databases. We fix three pairwise disjoint infinite sets: the set of constants A, 
the set of labeled nulls A nu u, and the set of variables V. Often we will denote 
a sequence of variables, constants or labeled nulls by a if the length of this 
sequence is understood from the context. A database schema 1Z is a finite 
set of relational symbols {R\, R n }. To every relational symbol R <E 1Z we 
assign a natural number ar(R) called its arity. A database position is a pair 
(R, i) where R € 1Z and i E [ar(R)], for short we write R\ e.g. a three-ary 
predicate S has three positions S , S 2 , S 3 . We say that a variable, labeled null, 
or constant c appears e.g. in position R 1 if there exists an atom R(c, ...). In the 
rest of the paper, we assume the database schema and the set of constants and 
labeled nulls to be fixed and therefore we will suppress them in our notations. 
A database instance I is a finite set of TvL-atoms that contains only elements 
from A U A nu u in its positions. The domain of /, dom(I), is the set of elements 
from A U A nu u that appear in /. 

Conjunctive Queries. A Conjunctive Query (CQ) is an expression of the 
form ans(x) <— ip(x,z), where ip is a conjunction of relational atoms, x, z are 
sequences of variables and constants, and it holds that every variable in x also 
occurs in ip. If x is empty we call the query boolean. The semantics of such 
a query q on database instance I is defined as q(I) := { a 6 Z\' x ' | / |= 3z<p(a, z) }. 

Constraints. Let x,y be sequences of variables. We consider two types of 
database constraints: tuple generating dependencies (TGDs) and equality gener- 
ating dependencies (EGDs). A TGD is a first-order sentence a := \tx{4>(x) — ► 
3ytp(x,y)) such that (a) both <j> and are conjunctions of atomic formulas (pos- 
sibly with parameters from A), (b) ip is n °t empty, (c) <p is possibly empty, (d) 
both 4> an d ip do not contain equality atoms and (e) all variables from x that 
occur in ip must also occur in (p. We denote by pos(a) the set of positions in <f>. 
An EGD is a first-order sentence a := Vx(<p(x) — > Xi = Xj), where Xi,Xj occur 
in (p and <p> is a non-empty conjunction of equality-free 7?.-atoms (possibly with 
parameters from A). We denote the set of positions in (p by pos(a). 
From now on we will use the word constraint instead of saying that a logical 
expression may be a TGD or an EGD. Satisfaction of constraints by databases 
is defined in the standard first-order manner and is therefore omitted here. 
We write I \= a if a constraint a is satisfied by I and I ty= a otherwise. As 
a notational convenience, we will often omit the V-quantifier and respective 
list of universally quantified variables. For a set of TGDs and EGDs S we 



set pos(S) := IJ^ex; P os (0- We use the term body{a) for a constraint a as 
the set of atoms in its premise; analogously we define head(a). In case a is 
a constraint and a is a sequence of labeled nulls and constants, then a(a) is 
the constraint a without universal quantifiers but with parameters a. We will 
often abuse this notation and say that a labeled null occurs in a(a), meaning 
that a labeled null is the parameter for some universally quantified variable in a. 

Homomorphisms. A homomorphism from a set of atoms A\ to a set of 
atoms A2 is a mapping /i : A U V — > A U A nu u such that the following 
conditions hold: (i) if c G A, then /i(c) = c and (ii) if R(c\, ...,c n ) G Ai, then 
R(fJ,(ci), ...,fl(Cn)) G A 2 . 

Chase. Let E be a set of TGDs and EGDs and 7 an instance, represented 
as a set of atoms. We say that a TGD Vxip 6 Z" is applicable to 7 if there 
is a homomorphism /z from body(Vxp) to 7 and /i cannot be extended to a 
homomorphism // D fx from head(Vxip) to 7. In such a case the chase step 

7 — ^ J is defined as follows. We define a homomorphism v as follows: 
(a) ^ agrees with fi on all universally quantified variables in ip, (b) for every 
existentially quantified variable y in Vxp we choose a "fresh" labeled null n y G 
/4 n „z; and define v(y) := n y . We set J to be 7 U v(head(Vxip)). We say that an 
EGD Vxip E S is applicable to 7 if there is a homomorphism fi from body(Vxip) 

to 7 and it holds that fi(xi) 7^ fJ-(xj)- In such a case the chase step 7 — -> 7 
is defined as follows. We set J to be 

• 7 except that all occurrences of fi{xj) are substituted by fi(xi) =: a, if n(xj) 
is a labeled null, 

• 7 except that all occurrences of /i(xi) are substituted by fJ-(xj) —: a, if /j,(xi) 
is a labeled null, 

• undefined, if both fi(xj) and are constants. In this case we say that the 
chase fails. 

A chase sequence is an exhaustive application of applicable constraints 
Iq tp ^—$ 7i f . . ., where we impose no strict order what constraint must be 
applied in case several constraints apply. If this sequence is finite, say I r being 
its final element, the chase terminates and its result 1$ is defined as I r . The 
length of this chase sequence is r. Note that different orders of application of 
applicable constraints may lead to a different chase result. However, as proven 
in |21j . two different chase orders lead to homomorphically equivalent results, if 
these exist. Therefore, we write I s for the result of the chase on an instance 7 
under constraints U. It has been shown in [814113] that I s \= S. In case that 
a chase step cannot be performed (e.g., because a homomorphism would have 
to equate two constants) the chase result is undefined. If we have an infinite 
chase sequence Iq ( ^—»° j ± ^ we distinguish two cases: (i) if the constraint 

set contains an EGD, then we also say that the result is undefined; (ii) if the 
constraint set consists of TGDs only then I E :— [J i>0 h is the union of all 



Schema: S(n), E(src, dst) 
Constraint Set: E := {a}, where 

a : If X2 is a special node and has some 

predecessor x\, then x\ has itself a predecessor: 
S(x2), E{x±,x 2 ) -> 3y E(y,xi) 



Fig. 2. A sample constraint. 



intermediate database instances during the application of the chase. 

Oblivious Chase. We will also use oblivious chase steps throughout this paper. 
An oblivous chase step for a TGD Vztp is denned as follows. The oblivious step 
applies to an instance / if there is a homomorphism fi from body(Vx(p) to /. 

In such a case the oblivious chase step I j j s defined as follows. We 

define a homomorphism v as follows: (a) v agrees with fi on all universally 
quantified variables in ip, (b) for every existentially quantified variable y in \fxip 
we choose a "fresh" labeled null n y € A nu u and define v(y) :— n y . We set J to 
be / U v(head(Vxip)). An oblivious chase step for an EGD is a chase step for 
an EGD except that we also add an * on the arrow (like in the case of TGDs) 
that indicates the step. Intuitively, an oblivious chase step always applies when 
the body of a constraint can be mapped to an instance, even if the constraint 
is satisfied. 



3 Data-independent Termination 

In this section we discuss the sufficient data-independent chase termination con- 
ditions presented in Figure [TJ First, we will review existing approaches and then 
introduce the novel class of safe constraints, which strictly generalizes weak 
acyclicity, but is different from stratification. Building upon the definition of 
safety, we then introduce inductively restricted constraints as a consequent ad- 
vancement of our ideas. The latter class strictly subsumes all termination con- 
ditions known so far. Finally, we will define a hierarchy of sufficient termination 
condition on top of inductively restricted constraints, the so-called T-hierarchy. 
Each level T[k] in this hierarchy is strictly contained in the next level T[k + 1]. 
Our novel sufficient termination conditions vastly extend the applicability of the 
chase algorithm, as they guarantee chase termination for much larger classes of 
constraints than previous conditions. 

As a minimalistic motivating example for our study of novel chase termination 
conditions let us consider the constraint set S from Figure [H which is settled 
in our graph database schema from the Introduction. As we shall see later, the 
chase with £ terminates for every database instance. Still, none of the existing 
termination conditions is able to recognize termination for this constraint set, 



i.e. £ is neither weakly acyclic nor stratified. With the techniques and tools that 
we develop within this section, we will be able to guarantee chase termination 
for £ on every database instance. 

3.1 Weak Acyclicity 

The notion of weak acyclicity from |10I21] is the starting point for our discussion. 
Informally spoken, the key idea of weak acyclicity is to statically estimate the 
flow of data between the database positions during the execution of the chase. 
Weak acyclicity asserts that no fresh values are created over and over again. 

Definition 1. (see [H]) The dependency graph dep(£) of a set of constraints £ 
is the directed graph defined as follows. The set of vertices is the set of positions 
that occur in some TGD in £. There are two kinds of edges. Add them as follows: 
for every TGD \fx((f>(x) — » 3yip(x,y)) € £ and for every i in s that occurs in ip 
and every occurrence of x in <p in position 7Ti 

• for every occurrence of x in ip in position 112 , add an edge ix\ — > 7T2 (if it does 
not already exist). 

• for every existentially quantified variable y and for every occurrence of y in 
a position 7T2, add a special edge ~k\ — > ~ki (if it does not already exist). 

A set £ of TGDs and EGDs is called weakly acyclic iff dep(£) has no cycles 
going through a special edge. □ 

Intuitively, normal edges in the dependency graph track the flow of data between 
the database positions and special edges cover the case of newly introduced null 
values. If the dependency graph contains no cycles through a special edge it can- 
not happen that fresh null values are cyclically added to the database instance. 
It has been shown in [21] that weak acyclicity can be decided in polynomial time. 
We illustrate the definition of weak acyclicity by example. 

Example 1. We depict the dependency graph for the constraint set £ := 
{cxi, ot2, as} from Figure [5| in Figure One can observe that £ is not weakly 
acyclic, as witnessed by the self-loop through special edge fly 2 A fly 2 . □ 

3.2 Stratification 

In [S], stratification was introduced which meant to improve the former weak 
acyclicity condition. The main idea behind stratification is to decompose the 
constraint set into independent subsets that are then separately tested for weak 
acyclicity. More precisely, the decomposition splits the input constraint set into 
subsets of constraints that may cyclically cause to fire each other. The idea is 
that the termination guarantee for the full constraint set should follow if weak 
acyclicity holds for each subset in the decomposition. 

Definition 2. (see [5]) Given two TGDs or EGDs a, 6 £ we define a -< iff 
there exists a relational database instance I and a, b such that (i) / ^= a (a), (ii) 

I\=(3(b), (iii) 1°^ J, and (iv) J ft 0(5). □ 



^ flyi rail 1 

hasAirport;^^ j f | 

" """" fly 2 ^) rail 2 

fl y3 raipQ 



Fig. 3. Dependency graph for I? from Figure [HI 



Intuitively, a -< [3 means that if a fires it can cause (1 to fire (in the case that (3 
could not fire before). We give an example to illustrate this definition. 

Example 2. (see J§j/) Let predicate E store the edge relation of a graph and let 
the constraint a := E(xi,x 2 ),E(x 2 ,xi) — > 3yi, y 2 E(x\, yi), E{y\, y%), E(y 2 , x\) 
be given, stating that each node having a cycle of length 2 also has a cycle of 
length 3. A 3-cycle can never be a 2-cycle again, so it holds that a -ft a. □ 

The actual definition of stratification then relies, as outlined before, on the notion 
of weak acyclicity. 

Definition 3. (see [9]) The chase graph G(U) = (S,E) of a set of constraints 
E contains a directed edge (a, 0) between two constraints iff a -< j3. We call S 
stratified iff the constraints in every cycle of G(S) are weakly acyclic. □ 

Stratification strictly generalizes weak acyclicity (see [S]), thus (i) if £ is weakly 
acyclic, then it is also stratified and (ii) there are constraint sets that are strat- 
ified but not weakly acyclic (cf. Example [3]). 

Example 3. Consider the constraint a from Example\M It holds that a / a, so 
{a} is stratified. As shown in J3j, the dependency graph of {a} contains a cycle 
through a special edge, so {a} is not weakly acyclic. □ 

It can be decided in coNP whether a set of constraints is stratified. The authors 
of [9] claimed the following result: 

Claim. [9] Let S be a fixed set of stratified constraints. Then, there exists a 
polynomial Q € N[X] such that for any database instance I , the length of every 
chase sequence is bounded by Q(\dom(I)\) . □ 

Unfortunately, we could show that this claim is wrong as the next example shows. 

Example 4- Given the set of TGDs S = {a\, ...,0:4}, where 

ax := R(xi) -» S(xi,xi), 

a 2 := S(xi, x 2 ) — > 3zT(x 2 , z), 

a 3 := S(xi, x 2 ) — > T(xi, x 2 ), T{x 2 , x{) and 

a 4 := T(xi,x 2 ),T(x 1 ,x 3 ),T(x 3 ,xi) -> R(x 2 ). 



We will give now an instance for which the chase does not necessarily termi- 
nate. Consider the database {R(a)} and the chase sequence which applies the 
constraints in the order ati, «4, <x%, 04, ... and so on. The first steps of the 
resulting chase sequence look as follows: 
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where n\,ri2 are fresh null values. It can be easily seen that this sequence is 
infinite. The chase graph for S is depicted in Figure ^ The only cycle in it is 
constituted by full TGDs only and therefore is weakly acyclic. Hence, it follows 
that S is stratified. □ 




Fig. 4. Chase graph for example 0] 

This has profound implications. Unlike weak acyclicity, stratification does not 
ensure termination of every chase sequence for every instance. However, we can 
prove another equally useful result with the definition of stratification as in [S] . If 
a set of constraints is stratified, we cannot ensure termination for every instance 
and every chase sequence, but for every instance there is some chase sequence 
that terminates as stated in the following theorem. We want to emphasize that 
this result is our own finding. 

Theorem 1. Let E be a fixed set of stratified constraints. Then, there exists a 
polynomial Q S N[X] such that for any database instance / there is a terminating 
chase sequence whose length is bounded by Q(\dom(I)\). 



Proof: For the termination see the proof of theorem [5J The polynomial data 
complexity follows immediately from the polynomial data complexity of weakly 
acyclic constraint sets. □ 
But how can we use this result in practice? The first idea is to apply the chase 
in a breadth-first manner, i.e. generating a tree whose root is the start instance, 
its children are obtained by applying one chase step on the start instance and 
the tree is expanded in breadth-first manner. This ensures that if there is a 
terminating chase sequence, then we will find it. Unfortunately, this is rather 
uneffective because in some intermediate instance there may be many constraints 
violated and therefore the degree and the depth of the tree may be high. 
As it turns out, we are in a much better situation here. We can use the chase 
graph to effectively construct the order in which the constraints must be applied 
to ensure termination. To the best of our knowledge, this is the first study of 
sufficient termination conditions for the chase which does not ensure the termi- 
nation of all chase sequences but at least of some sequence. Additionally, the 
sequence can be effectively constructed. We give an example to illustrate this. 

Example 5. Consider the constraint set S from Example [7] again and the in- 
stance {R(a),T(b,b)}. We give a chase sequence that terminates. 

{R(a),T(b,b)} 
{R(a),T(b, b),S(a,a)} 
{R(a),T(b,b),S(a,a),T(a,a)} 
{R(a), T(b, b),S(a, a),T(a, a), R(b)} 
{R{a), T(b, b),S(a, a), T(a, a), R(b),S(b, b)} 
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It holds that {R(a),T(b,b), S(a,a),T(a,a), R(b), S(b,b)} |= S. We obtained a 
terminating chase sequence by first chasing with the constraints in the cycle and 
after the chase with these constraints is finished we (possibly) chase with a.2, 
which was not necessary here. It can be shown that this strategy leads always 
to finite chase sequences, regardless of the underlying instance. Intuitively, this 
works here because violations of oli can be repaired with the help of a^. □ 

The observations made in this example can be generalized to the next theorem. 

Theorem 2. Let £ be a fixed set of stratified constraints. If the chase termi- 
nates independently of the database instance and independently of the chase 
order for every strongly-connected component of the chase graph G(S), then 
for every database instance / a terminating chase sequence can be effectively 
constructed. 

Proof: Let the chase graph G(E) = (£, E) be given. We write a ~ j3 if and 
only if a and (3 are contained in a common cycle in G(S) or a = (3. Note that 
~ is an equivalence relation. Let S/ ~= {V^i, W„} and E' := { (Wi,Wj) 
| i, j 6 [n],i i= j, there is some ati G Wi,[3j £ Wj such that oti -< (3j }. Let 
W{, W' n be a topological sorting of (S / ~, E'). Note that W{, W' n are the 



strongly connected components of the chase graph and constraint sets that are 
not involved in any cycle in the chase graph, therefore the chase terminates 
independently of the database instance and independently of the chase order 
for these constraint sets. Let Iq be an arbitrary database instance. Let Ii be 
obtained from 7j_j by chasing with W[ for every t £ [n]. It holds that I2 \= W[. 
Otherwise there is some a £ W[, (3 £ W 2 and a database instance I such that 

I \= a, but I J — > J \/= a. But this implies j3 -< a which means W 2 must come 
before W[ in the topological sorting of (£ / ~, E'). Using induction on n it can 
be seen that I n \= £ (observe that W{, W n is a partition of £). □ 
This allows us to apply the chase procedure safely in situations when the termi- 
nation cannot be guaranteed for every chase sequence. We avoid the overhead 
of branching in the breadth-first chase and therefore reduce the complexity of 
generating a chase result. This opens the door to the area of sufficient termi- 
nation conditions for the chase which ensure, independently of the underlying 
data, the termination of at least one chase sequence and not necessrily of all. 
Furthermore, in the next section, we propose a possible correction of the strati- 
fication condition which ensures the termination for every chase sequence, using 
the oblivious chase. 



3.3 C-Stratification 

We propose a possible correction of stratification, called c- stratification, which 
has the property of ensuring the termination of the chase independent of the 
data and the chase sequence used. The underlying ideas are the same as in [3]: 
decompose the constraint set such that if a termination guarantee can be made 
for every subset, then the same guarantee can be made for the overall set. 

Definition 4. (see [pj) Given two TGDs or EGDs a, [3 £ £ we define a ^ c (3 
iff there exists a relational database instance I and a, b such that (i) I ^= a(a), 

(ii) I f= /9(F), (iii) I J, and (iv) J ^ (3(b). a 

Note that in this definition we use an oblivious chase step and not a standard 
chase step. We give an example to illustrate this definition. 

Example 6. (see and before) Let predicate E store the edge rela- 

tion of a graph and let the constraint a := E{x\, x%), E(x%, X\) — * 
3yi, y2E(xi, yi), E(y±, y?), E(y2, %i) be given, stating that each node having a 
cycle of length 2 also has a cycle of length 3. A 3-cycle can never be a 2-cycle 
again, so it holds that a -fi c a. □ 

The actual definition of c-stratification then relies, as outlined before, on the 
notion of weak acyclicity. 

Definition 5. The c-chase graph G c (£) = (£,E) of a set of constraints £ 
contains a directed edge (a, f3) between two constraints iff a < c (3. We call £ 
c-stratified iff the constraints in every cycle of G c (£) are weakly acyclic. □ 



Example 7. Consider the constraint set from example^again. The problem there 
was that ct%, the only TGD containing existential quantifiers, had no successor 
in the ^-relation. However, in the c-chase graph ct2 has a successor and indeed 
there is a cycle through ct2 as witnessed by Figure [31 The only strongly connected 
component is S itself, which is not weakly acyclic. So, £ is not c-stratified as 
witnessed by the non-terminating chase sequence in example^ 




Fig. 5. Chase graph for example [7] 

From the definition of -< c it is not immediately clear that it is decidable, however 
the test for membership in -< c can be done with linear-sized databases. 

Proposition 1. It can be decided in coNP whether a set of constraints is c- 
stratified. □ 

Proof Sketch. We start with an additional claim: let a, (3 be constraints. Then, 
the mapping (a, (3) i— > a -< c (31 can be computed by an NP-algorithm. The proof 
of this claim proceeds like the proof of Theorem 3 in [5] . It is enough to consider 
candidate databases for / of size at most |a| + \(3\, i.e. unions of homomorphic 
images of the premises of a and (3 s.t. null values occur only in positions from P. 
Because of this claim, the c-chase graph of a set of constraints can be computed 
by an NP-algorithm. To prove that £ is not c-stratified, guess some strongly 
connected component of the c-chase graph £' and verify that it is not weakly 
acyclic. □ 

And indeed c-stratification does ensure the termination of the chase independent 
of the data and the chase sequence as the following theorem states. 

Theorem 3. Let £ be a fixed set of c-stratified constraints. Then, there exists 
a polynomial Q £ N[X] such that for any database instance /, the length of 
every chase sequence is bounded by Q(\dom(I)\). □ 

Proof Sketch. Let £ be the set of constraints under consideration. Let 
SC\, SC n be the strongly connected components of G C (E). We will show 
the following lemma. 

Lemma 1. If the chase terminates data- independently for every strongly con- 
nected component of G C (S), then it terminates data-independently for S. □ 



Proof Sketch. Assume that we have a database instance io such that the chase 
does not terminate. We will construct an infinite chase sequence that uses only 
constraints from some of the C^. 

We have an infinite chase sequence S — Iq t ^- L -> h -^-+ ■ ■ •• Without loss of 
generality, we can assume that every constraint from S fires infinitely often and 
that for every j 6 N there is some i > j such that \= 0^(5^), where I' n := I a , 

Ii-i ai ^ 1 Ji for I ^ j and Jj :— Jj-%. Let k £ N arbitrary. 

Consider the infinite set f := { i £ No | Io \= 
ai(a,i), (ai,cii) appears on an arrow in S }. For i £ W that is large enough, we 
find a sequence Si of constraints of length i > U > 2 such that 

— s i — 0i,l, fli,h, 

— for every there is some j such that (3^ = otj, 

— there is an instance Jo such that | Jq\ < Ylje[h] \b°dy(Pi,j)\ an d Jo /3 *^~4 A 

J\ P'-li-i'- ... w here every pair (f3i > i,ai ! i) appears on an arrow 

in S, 

— for every j £ [Zj — 1] it holds that J/.^ h= (^i,;*)) where Jq := Jo, 

J[_ x J[ for U > I ^ j and Jj := Jj_ 1 and 

— is maximal with these properties. 

For every i £ W large enough there is an infinite sequence of such se- 
quences of constraints (s^^-gN such that s,. is a subsequence of Sj. +1 
and s l0 = s». Consider C 4 := UjeNu{0} s h and C i,danger := { /3 £ C; 

forall A; £ N there is some s.; . such that (3 appears at least k times in Si . 
}. We find some SC V {%' £ [n]) and C[ := { /3 M , /3 Wfc £ SC* | 
A,; £ C\ yd anger,h e W } such that C- C SCi> n Ci !dQ „g er . With the help 
of (si we can show that there must be an infinite chase sequence for C[. 
This concludes the proof of the lemma. □ 

The proof of the theorem follows from the application of the previous lemma. 
The polynomial time data complexity follows from the fact that c-chase graphs 
bound the depth of null values (see [IT]) data-independently. Note that by 
assumption, chasing with Ci terminates in time Qi(\dom(I)\). □ 



3.4 Safety 

The basic idea of our new termination condition, safety, is to estimate the set 
of positions where labeled nulls may be copied to and (statically) analyze the 
data flow only between those positions. As a useful tool, we borrow the notion of 
so-called affected positions from [5] , which is an overestimation of the positions 
in which a null value that was introduced during the chase may occur. 

Definition 6. Let S be a set of TGDs. The set of affected positions affi-(E) 
of S is defined inductively as follows. Let 7r be a position in the head of an 
a £ E. 



R 3 ■+ R 1 > S 1 

'I . 



Fig. 6. Left: Dependency graph. Right: Corresponding propagation graph (it 
has no edges). 

• If an existentially quantified variable appears in n, then n G affi(E). 

• If the same universally quantified variable X appears both in position 7r and 
only in affected positions in the body of a, then it G aff(Z'). □ 

Although we borrow this definition from [5], our focus is different. We use affected 
positions to extend known classes of constraints for which the chase terminates, 
whereas [5] investigates query answering in cases the chase may not terminate. 
Our work neither subsumes [5] nor the other way around. 

We motivate the safety termination condition using the single constraint /3 := 
R(x±, x-2, X3), S(x2) — > 3yR(x2,y,Xi). The dependency graph of constraint set 
{/3} is shown in Figure [5] (left). As can be seen, there is a cycle going through a 
special edge, so the set is not weakly acyclic. We next study the affected positions 
in (3: 

Example 8. Let us consider the constraint set S := {/?}. Clearly, position R 2 is 
affected because it contains an existentially quantified variable. S 1 is not affected 
because S is not modified when chasing with the single constraint (3. Finally, we 
observe that also R 1 is not affected because xi occurs not only in R 2 but also in 
S , which is not an affected position. We conclude that position R 2 is the only 
affected position in constraint set S. □ 

We now argue that for constraint (3 a cascading of fresh labeled nulls cannot 
occur, i.e. no fresh labeled null can repeatedly create new labeled nulls in po- 
sition R 2 while copying itself to position R 1 . The reason is that (3 cannot be 
violated with a fresh labeled null in R 2 , i.e. if R{a\,a2,a^) and S{a2) hold, but 
3yi?(<22, y, (Zi) does not, then a2 is never a newly created labeled null. This is due 
to the fact that 02 also occurs in S 1 , but S 1 is not an affected position. Hence, 
the chase sequence always terminates. We will later see that this is not a mere 
coincidence: the constraint is safe. 

Like in the case of weak acyclicity, we define the safety condition with the help of 
the absence of cycles containing special edges in some graph, called propagation 
graph. 

Definition 7. Given a set of TGDs S, the propagation graph prop(Z') := 
(aff(Z'), E) is the directed graph defined as follows. There are two kinds of edges 



in E. Add them as follows: for every TGD Vx(cf)(x) — > 3yip(x,y)) £ 2J and for 
every x in x that occurs in ^ and every occurrence of x in <f> in position 7Ti 

• if x occurs only in affected positions in <f> then, for every occurrence of x in 
ip in position TT2, add an edge tt\ — > 7T2 (if it does not already exist). 

• if a; occurs only in affected positions in ^ then, for every existentially quan- 
tified variable y and for every occurrence of y in a position 7r2 , add a special 
edge 7Ti — > 7T2 (if it does not already exist). □ 

As an improvement over weak acyclicity, in the propagation graph we do not 
supervise the whole data flow but only the flow of labeled nulls that might 
be introduced at runtime. Consequently, the graph contains edges only for null 
values that stem exclusively from affected positions. We now can easily define 
the safety condition on top of the propagation graph. 

Definition 8. A set £ of constraints is called safe iff prop(I7) has no cycles 
going through a special edge. □ 

Example 9. Consider the constraint (3 from Example [3 Its dependency graph 
is depicted in Figure [6] on the left side and its propagation graph on the right 
side. The latter contains only the affected position R 2 (and no edges). From 
Definitions^ and\^ it follows that (3 is safe, but not weakly acyclic. □ 

The intuition of safety is that we forbid an unrestricted cascading of null values, 
i.e. with the help of the propagation graph we impose a partial order on the 
affected positions such that any newly introduced null value can only be created 
in a position that has a higher rank in that partial order in comparison to null 
values that may occur in the body of a TGD. To state this more precisely, assume 
that a TGD of the form Vx((f)(x) — > 3yip(x,y)) is violated. Then, / |= <fi(a) and 
/ Y= 3yip(a,y)) must hold. The safety condition ensures that any position in the 
body that contains a newly created labeled null from a and occurs in the head of 
the TGD has a strictly lower rank in our partial order than any position in which 
some element from y occurs. The main difference compared to weak acyclicity 
is that, in safety, we look in a refined way (cf. affected positions) on positions 
where labeled nulls can be propagated to. 

It is easy to see that, if a constraint set £ is safe, then every subset of £ is safe, 
too. Furthermore, we note that, given a set of constraints, it can be decided in 
polynomial time if it is safe or not. In the following theorem we relate safety to the 
previous termination conditions weak acyclicity and stratification. In particular, 
the theorem clarifies the observation from Example [91 where we could observe 
that the propagation graph is a subgraph of the dependency graph. This is not 
a mere coincidence: 

Theorem 4. Let E be a set of constraints. 

• The graph prop(Z') is a subgraph of dep(Z"). 

• If £ is weakly acyclic, then it is also safe. 

• There is some £ that is safe, but not c-stratified and vice versa. □ 



Proof Sketch, (a) The set of vertices from prop(Z') is contained in the set of 
vertices of dep(I7). In order to add an edge to prop(I7) stronger prerequisites 
must be fulfilled than in the construction of dep(S). Therefore prop(Z') is a 
subgraph of dep(S). (b) If dcp(S) does not have a cycle through a special edge, 
then prop(Z') cannot have, (c) Let a := S(x2, X3), R{x\, X2, X3) — > 3yi?(x2, y, x\) 
and (3 := R(xx,X2,Xs) — > S(xi,x 3 ). It can be seen that a ~< (3 and /3 ~< a. 
Together with the fact that {a, (3} is not weakly acyclic it follows that {a, (3} is 
not stratified. However, {a,/3} is safe. Let 7 := T(xx,X2), T(x 2 ,Xi) — ► 3 2/1,2/2 
T(xi, yi), T(yi, 2/2), r(j/2) a:i) (see [5]). It was argued in [3] that {7} is stratified. 
However, it is not safe because both T 1 and T 2 are affected. Therefore we have 
that dep({7}) = prop({7}) and it was argued in 9\ that it is not weakly acyclic. 
□ 

Like stratification and weak acyclicity, safety guarantees the termination of the 
chase in polynomial time data complexity, i.e. the set of constraints is fixed and 
the number of chase steps is polynomial in the number of distinct values in the 
input database instance: 

Theorem 5. Let U be a fixed set of safe constraints. Then, there exists a poly- 
nomial Q G N[X] such that for any database instance /, the length of every 
chase sequence is bounded by Q(\dom(I)\). □ 

Proof Sketch. First we introduce some additional notation. We denote con- 
straints in the form c/)(xi,X2,u) — > 3yip(xi, X2,V), where aTf, xj, u are all the 
universally quantified variables and 

• u are those variables that do not occur in the head, 

• every element in ~x\ occurs in a non-affected position in the body, and 

• every element in X2 occurs only in affected positions in the body. 

The proof is inspired by the proof of Theorem 3.8 in [2T], especially the notation 
and some introductory definitions are taken from there. In a first step we will 
give the proof for TGDs only, i.e. we do not consider EGDs. Later, we will see 
what changes when we add EGDs again. 

Note that S is fixed. Let (V,E) be the propagation graph prop(I7). For every 
position 7r € V an incoming path is a, possibly infinite, path ending in it. We 
denote by rank(jt) the maximum number of special edges over all incoming 
paths. It holds that rank(ir) < 00 because prop(I7) contains no cycles through 
a special edge. Define r := max{ rank(ir) | ir G V } and p :— \V\. It is easily 
verified that r < p, thus r is bounded by a constant. This allows us to partition 
the positions into sets A*o, ...,N P such that iVj contains exactly those positions 
7r with rank(ir) = i. Let n be the number of values in /. We define dom(S) as 
the set of constants in S. 

Choose some a :— cf)(xi,X2,u) — > 3yip(xi, X2, y) e S. Let I G a,a -L^f b 

G' and let c be the newly created null values in the step from G to G' . Then 

1. newly introduced labeled nulls occur only in affected positions, 

2. ai C dom(I) U dom(S) and 



3. for every labeled null Y 6 02 that occurs in tt in 4> and every c 6 c that 
occurs in p in ip it holds that rank(n) < rank(p). 

This intermediate claim is easily proved by induction on the length of the chase 
sequence. Now we show by induction on i that the number of values that can 
occur in any position in Ni in G' is bounded by some polynomial Qi in n that 
depends only on i (and, of course, £). As i < r < p, this implies the theorem's 
statement because the maximal arity ar(lZ) of a relation is fixed. We denote by 
body(E) the number of characters of the largest body of all constraints in E. 

Case 1: i = 0. We claim that Q Q {n):=n + \S\ ■ n ^r(n)-body(S) ig sumcient f or our 
needs. We consider a position it € No and an arbitrary TGD from £ such that 
tt occurs in the head of a. For simplicity we assume that it has the syntactic 
form of a. In case that there is a universally quantified variable in tt, there 
can occur at most n distinct elements in tt. Therefore, we assume that some 
existentially quantified variable occurs in tt in ip. Note that as i — it must 
hold that \x2 \ = 0. Every value in I can occur in tt. But how many labeled 
nulls can be newly created in tt? For every choice of aT C dom(G) such that 
G \= (f>(ai,X,b) and Or 6 3yip(ai, X,y) at most one labeled null can be added to 
7r by a. Note that in this case it holds that aT C dom(I) due to (1). So, there are 
at most n °r(n)-bodv{Z) such choices. Over all TGDs at most \S\ ■ n «r(nybody(E) 
labeled nulls are created in tt. 

Case 2: i — » i + 1. We claim that Qi+\(n) := Y?j=oQi( n ) + 1^1 ' 
{Y J ) =0 Q l {n)) ar{nyhodv{s) is such a polynomial. Consider the fixed TGD 
a. Let tt 6 JVj+i. Values in ir may be either copied from a position in 
iVo U ... U Ni or may be a new labeled null. Therefore w.l.o.g. we assume 
that some existentially quantified variable occurs in tt in tp. In case a TGD, 
say a, is violated in G' there must exist 01,02 C dorriQi{No, Ni) and 
b C dom{G') such that G' \= ^(01,02,6), but G' ¥ 3yt/j(ai,a2,y). If newly 
introduced labeled null occurs in 02, say in some position p, then p 6 U}=o -^Y? • 
As there are at most (X)j=o Qi(^)) ar ^^ - bod v( s ) many such choices for 01,02, 
at most (5^*-_q Qi{n)) ar ^ > ' hody ^ many labeled nulls can be newly created in tt. 

When we allow EGDs among our constraints, we have that the number of values 
that can occur in any position in Ni in G' can be bounded by the same polyno- 
mial Qi because equating labeled nulls does not increase the number of labeled 
nulls and the fact that EGDs preserve valid existential conclusions of TGDs. □ 

3.5 Inductive Restriction 

In this section we generalize the method that lifts weak acyclicity to stratifi- 
cation from [3] with the help of so-called restriction systems. The chase graph 
from [9] will be a special case of such a restriction system. With the help of 
restriction systems we then define a new sufficient termination condition called 



inductive restriction, whose main idea is to decompose a given constraint set 
into smaller subsets (in a more refined way than stratification). We then use 
the safety condition from before to check the termination of every subset and, 
whenever all subsets are safe, the termination for the full constraint set can be 
guaranteed. Ultimately, we show that inductive restriction (like all the classes 
discussed before) guarantees chase termination in polynomial-time data com- 
plexity. This section also lays the foundations for the T- hierarchy (cf. Figure [T]) , 
which will be defined subsequently in Section 13.61 We motivate our study with 
a constraint set that is neither safe nor stratified. 

Example 10. Let predicate E(x,y) store graph edges and predicate S(x) store 
some nodes. The constraints set S = {0:1,02} with ct\ :— S(x), E(x,y) — > 
E(y,x) and 02 := S(x), E(x,y) — > 3zE(y, z), E(z, x) assert that all nodes in 
S have cycles of length 2 and 3, respectively. It holds that aff(S) = {E 1 ^ 2 } 
and it is easy to verify that S is neither safe nor stratified. In particular, it we 
observe that ol\ < a-i and a% -< oc\. □ 

The first task in our formalization is a refinement of relation -< from [9j. This 
refinement will helps us to detect if during the chase null values might be copied 
to the head of some constraint. To simplify the definition, we introduce the 
notion of null-pos: 

Definition 9. Let £ be a set of constraints, I be a fixed database instance 
and A C A nu u. Then, we define null-pos(^4, 7) as {tt £ pos(S) \ a € 
A, a occurs in position ir in /}. □ 

Informally spoken, null-pos(A, I) is the set of positions in I in which the elements 
(i.e., labeled nulls) from A occur. We are now ready to define the refinement of 
relation -<: 

Definition 10. Let E be a set of constraints and P C pos(S). For all a, (3 € £, 
we define a -<p (3 iff there are tuples a, b and a database instance 1$ such that 

• h -> h, 

• h \h 0(b), 

• there is n G b n A nu u in the head of (3(b) such that null-pos({n}, Iq) C P, 
and 

• Io |= 0{b). □ 

The refinement of -< forms the basis for the notion of a so-called restriction 
system, which is a strict generalization of the chase graph introduced in [9] and 
will serve as a central tool in our work. The two definitions below formalize 
restriction systems. 

Definition 11. For any set of positions P and a TGD a let aff-cl(a,P) be the 
set of positions ir from the head of a such that 

• for every universally quantified variable x in 7r: x occurs in the body of a 
only in positions from P or 



• 7r contains an existcntially quantified variable. 



□ 



Definition 12. A 2-restriction syste is a pair (G'(£),f), where G'(£) := 
(£, E) is a directed graph and / C pos(£) such that 

• forall TGDs a and forall (a, 0) G E 1 : aff-cl(a, /)npos(i;) C / and aff-cl(/3, /)n 
pos(Z') C /, 

• forall a,(3 E £: a < f (3 (a,/3) E E. 

A 2-restriction system is minimal if it is obtained from ((£, 0),0) by a repeated 
application of the constraints from bullets one to three (until all constraints 
hold) s.t., in case of the first , / is extended only by those positions that are 
required to satisfy the condition. □ 

We illustrate this definition by two examples. The first one also shows that 
restriction systems always exist. 

Example 11. Let £ a set of constraints. Then, (G(S),pos(S)) ) is a 2-restriction 
system for constraint set £ . □ 

Example 12. Consider £ from Example \1(A The minimal 2-restriction system 
for S is G'(S):=(S,{(a2,cti)}) with f := {E 1 ^ 2 }; in particular, a\ -fcf) ax, 
<*i ~Af a 2, ct2 -<f otii and «2 -fif ot2 hold. □ 

Restriction systems are useful tools to define new classes of constraints that 
guarantee chase termination. To give an example, one can show that the chase 
with a constraint set £ terminates for every database instance if every strongly 
connected component of its minimal 2-restriction system is safe. We refer the in- 
terested reader to [TS] for details, where this class was formally introduced under 
the name safe restriction. Note that the constraint set £ from Example 1101 falls 
into the class of safely restricted constraints, because its minimal 2-restriction 
system (given in Example 1 12p contains no strongly connected component. In this 
work, we skip the formal definition of safe restriction, but instead go one step 
further and define a termination condition called inductive restriction, which 
further generalizes safe restriction. The following example provides a constraint 
set that is not safely restricted but, as we shall see later, falls into the class of 
inductively restricted constraints. 

Example 13. We extend the constraint set from Example \12\ 
to £' := £ U {0:3}, where 03 := 3x, yS(x), E(x, y). Then 
G'(£'):=(£ , ,{(a 1 , a2 ),(a2,ai),(a 3 ,a 1 ),(a 3 ,a2)}) with f{E 1 ,E 2 , S 1 } is the 
minimal 2-restriction system. It contains the strongly connected component 
{ai,ct2}. Note that £' is neither safe, nor stratified, nor safely restricted. 
Hence, using the sufficient termination conditions discussed so far no chase 
termination guarantees can be made for £' . □ 



1 In |18I17I22] the notion of a 2-restriction system was simply called restriction system 
and was defined slightly different there. 



part(r: Set of TDGs and EGDs, k: not equal to 1) { 


1 


compute the strongly connected components (as 
sets of constraints) Ci , . . . , C„ of the minimal 
fe-restriction system of S\ 


2 


£> ^ 


3 


if (n == 1) then 


4 


if (Ci / E) then 


5 


return part(Ci,fc); 


6 


endif 


7 


return {£}; 


8 


endif 


6 


for i=l to n do 


9 


Di-J)U part(C l5 fc); 


10: endfor 


11: return D; } 



Fig. 7. Algorithm to compute subsets of £. 



Intuitively, in the example above the constraint "infects" position S* 1 in the 
2-restriction system. Still, null values cannot be repeatedly created in S . a.3 
fires at most once, so it does not affect chase termination. Our novel termina- 
tion condition resolves such situations by recursively computing the minimal 
2-restriction systems of the strongly connected components. We formalize this 
computation in Algorithm 1, called part(£, 2) and define the class of inductively 
restricted constraint sets by help of this algorithm. 

Definition 13. Let £ be a set of constraints. We call £ inductively restricted 
iff every £' E part(£, 2) is safe. □ 

Compared to stratification, inductive restriction does not increase the complexity 
of the recognition problem: 

Lemma 2. Let £ be a set of constraints. The recognition problem for inductive 
restriction is in coNP. □ 

Proof Sketch. We start with an additional claim: let P be a set of positions and 
a. (3 constraints. Then, the mapping (P, a, [3) *—>■ a ~<p [3? can be computed by 
an NP-algorithm. The proof of this claim proceeds like the proof of Theorem 3 
in [9]. It is enough to consider candidate databases for Iq of size at most \a\ + \f3\, 
i.e. unions of homomorphic images of the premises of a and (3 s.t. null values 
occur only in positions from P. Because of this claim, the minimal 2-restriction 
system of a set of constraints can be computed by an NP-algorithm (only poly- 
nomially many steps must be performed to reach the fixedpoint). Computing 
part(£, 2) can also be done in non-deterministic polynomial time. To prove that 
£ is not inductively restricted, guess some £' E part(£, 2) and verify that it is 
not safe. □ 

We give an example for an inductively restricted constraint set, which - as argued 
in Example [TBI - is neither safe nor stratified. 



Example 14- Referring back to Example we have seen that the minimal 2- 
restriction system of S' contains the only strongly connected component {01,0:2}; 
which by Examvle \l(h is not safe. Therefore we compute the minimal 2 -restriction 
system 0/(01,02} and see that it does not contain a cycle. This argumentation 
proves that part(S' ',2) = 0, so we conclude that constraint set £' is inductively 
restricted. □ 

As depicted in Figure [TJ the inductive restriction condition generalizes both 
safety and weak acyclicity. The following proposition formally states these results 
and shows that the respective inclusion relationships are proper. Please note that, 
as we will show later inductive restriction ensures chase termination independent 
of the database and the chase sequence, therefore it cannot extend stratification, 
see example |H 

Proposition 2. The following claims hold. 

• If S is safe, then it is inductively restricted. 

• There is some E that is stratified, but not inductively restricted. 

• There is some S that is inductively restricted, but neither safe nor c- 
stratified. □ 

Proof Sketch. We start with bullet one. It follows from the fact that every 
subset of a safe constraint set is safe. Bullet two follows from example 2] Finally, 
bullet three is proven by the constraint set from Examples Q2] and Q3] □ 

The next theorem gives the main result concerning inductive restriction, showing 
that it guarantees chase termination in polynomial time data complexity. We 
refer the interested reader to theorem [7] for a formal proof of this theorem. 

Theorem 6. Let U be a fixed set of inductively restricted constraints. Then, 
there exists a polynomial Q £ N[X] such that for any database instance /, the 
length of every chase sequence is bounded by Q(\dom(I)\). □ 

We conclude with the remark that our motivating constraint set from Figure [2] 
is not inductively restricted: the constraint o can cause itself to fire, so its min- 
imal 2-restriction system contains an edge from o to o, which forms a strongly 
connected component; further, o is not safe. To show that the chase with o 
terminates, we need weaker termination conditions than inductive restriction. 



3.6 The T-Hierarchy 

This section introduces the T-hierarchy, which is our main result regarding data- 
independent chase termination. Its lowest level, T[2], corresponds to inductive 
restriction. Every level in the hierarchy is decidable and contains all lower levels. 
As we shall see, also the constraint from Figure [2] is a member of some level 
in this hierarchy. In the course of this section we leave out some proofs for 
space limitations, referring the interested reader to the technical report [H] . We 
start by defining the fc-ary relation -<k,P which is a generalization of -<p. The 
definition naturally extends the <p relation to a fixed number k of constraints. 



Definition 14. Let fc > 2, S a set of constraints and P C pos(U). For all 
ai,..., afc G 17, we define -< kt p (oti, ct k ) iff there are tuples ai,...,a k and a 
database instance Iq such that 

• for all % G [A; — 1] it holds that Z$_i *'°^ a ' J it 

• I k -i h a k (a k ), 

• there is n G a k H in the head of afe(afe) such that null-pos({n}, Jo) C P, 

• I |= a* (3k), and 

• for every i G [A; — 1] it holds that Jfc_i is defined and Jfc_i |= a k (a k ), where 
Jo := Iq, Ji-i *' a H ai J l for k > I i and J := J^-i. □ 

Note that -<2,p corresponds exactly to -<p introduced in Definition [101 It can 
be shown that, for a fixed value of k, membership in this relation is decidable in 
NP: 

Proposition 3. Let k > 2 be fixed. Then, there exists a NP-algorithm that 
decides for every set of positions P and every ai,...,a k G £ whether -<t.p 
(oti, Qffe) holds. □ 

Proof: Let k > 2 be fixed and ai,...,afc be TGDs and EGDs. Assume 
-<fc,p (oil, ccfe). Choose a database instance Jo and sequences ai,...,3fc such 
that the definition of -<k,P a k) holds. For all % G [k — 1] set *' < ^i> ai /- 

and * ,a ^ a ' Ife. There is a sequence of homomorphisms hi, h k such that 
ft-i : body(ai) — > 7j_i for all i G [fc]. Let Jo Q lo be the minimal subinstance 

(with respect to set cardinality) such that for all i G [k] J^_i Ji and 

Ji C p. Then Jo and Si,...,Sfe also satisfy the conditions from the definition 
of -<k.p (oil, a k)- Furthermore, it must hold that |Jo| < J2 i£ [k](\^ 0( ^y( ai )\ + 
\head(oti)\) < Y^ie[ k ] \ a i\> w h ere \ a i\ denotes the length of the sequence of sym- 
bols of the formula a^. So only finitely many candidate databases have to be 
examined, which completes the proof. □ 

We next use the relation -<k,p to define fc-restriction systems, which naturally 
generalize the 2-restriction systems defined over relation -< p (cf . Definition [T3J) . 

Definition 15. Let k G N>i. A fc-restriction system G' k (S) is a pair (G',f), 
where G' = (U, E) is a graph and / C pos(S) such that 

• forall TGDs a and forall (a, 13) G E: aff-cl(a, /) C / and 

• forall ai, a k G S: -< k ,f (&i, ot k ) then (ai, a 2 ), (a k -i,a k ) G E. 

A fc-restriction system is minimal if it is obtained from ((£, 0), 0) by a repeated 
application of the constraints from bullets one and (until all constraints hold) 
such that, in case of the first bullet, / is extended only by those positions that 
are required to satisfy the condition. In case the second bullet is applied, E is 
extended. □ 

Note that for fc = 2 this definition corresponds exactly to the definition of 
2-restriction systems used to define inductive restriction. Like 2-restriction sys- 
tems, minimal fc-restriction systems are unique and can be computed by a coNP- 
algorithm: 



Proposition 4. Let k > 2 be fixed and £ a set of constraints. The minimal 
fc-restriction system for £ is unique and can be computed by a NP-algorithm. 

□ 

Proof: Uniqueness follows directly from the definition: the computation is 
monotone and bounded. The computation takes polynomially many steps and 
each step requires at most one guess if -<k,f ( a ii ot-k) holds. Clearly, this 
algorithm runs in non-deterministic polynomial time. □ 

We are now in the position to define the T-hierarchy: 

Definition 16. Let k > 2 and £ be a set of constraints. Then £ € T[k] iff there 
is k' E [k]\{l} such that for every £' € part'{£, k') it holds that £' is safe. □ 

We call T[k] the fc-th level of the T-hierarchy. As a corollary from Proposition 0] 
we obtain that we can decide whether a set of constraints is in T[k) by a coNP- 
algorithm. We next give an example for constraints in the T-hierarchy. 

Example 15. We set £k+i '■= { a k+i}, where ctk+i ■= 
S(xk+i),Rk(xi,—,Xk+i) -> 3yR k (y,x-L,...,Xk). It holds that (a,..., a) but 
not -<k+i ( a i •••) a )- S° th- e minimal (k + 1) -restriction system does not contain 
any cycle, but the minimal k-restriction system does. Therefore £k+i € T[fc + 1]. 
On the other hand, we observe that the constraint is not safe, so it is not 
contained in T[k]. Also note that the constraint in Figure^ exactly corresponds 
to £2, so it is contained in level T[3]. □ 

The following proposition relates the levels of the T-hierarchy to each other and 
inductive restriction. 

Proposition 5. Let k > 2. 

• £ is inductively restricted iff £ G T[2] 

• T[k] C T[k + 1]. 

• There is some £ such that £ G T[k + l]\T[k]. □ 

Proof Sketch, (a) To prove bullet one, note that both definitions coincide 
exactly, (b) Bullet two follows by definition, (c) For bullet three we refer back 
to Example [T51 □ 

The next result is our main contribution concerning data-independent chase 
termination. It states that, for a fixed value of k, membership in T[k] guarantees 
polynomial time data complexity for the chase. Again, the technical proof can 
be found in [IB"] . 

Theorem 7. Let k > 2 and £ € T[k] be a fixed set of constraints. Then, there 
exists a polynomial Q £ N[X] such that for any database instance /, the length 
of every chase sequence is bounded by Q(\dom(I)\). □ 



Proof Sketch: Let S be the set of constraints under consideration. Let 
Ci,...,C n be the strongly connected components of the minimal k- restriction 
system of U. We will show the following lemma. 

Lemma 3. If the chase terminates data-independently for every strongly con- 
nected component of the minimal fc-restriction system of S, then it terminates 
data-independently for U. □ 

Proof Sketch. Assume that we have a database instance Iq such that the chase 
does not terminate. We will construct an infinite chase sequence that uses only 
constraints from some of the CV 

We have an infinite chase sequence S = Iq — -> I\ — ^ .... Without loss of 
generality, we can assume that 

1. every constraint from S fires infinitely often, 

2. that for every j £ N there is some i > j such that \= a>i(ai), where 

Iq := Iq, Ii-i a ^S l Ji for I ^ j and Jj :— Jj-i and 

3. for all i £ N we have that dom(Ii)\dom(Ii-i) fl A = 0. 

This infinite chase sequence will serve as a witness for the fact that some strongly 
connected component of the minimal fc-restriction system has already an infinite 
chase sequence. 

For every i £ N\[fc] we do the following. Let n be a null value of level i such that 
n £ dom{Ih)\dom(Ih-i). So n was introduced in chase step h which must be 
due to an application of a TGD. W.l.o.g. there is a chase step h! > h minimal 
such that ah'(cih') is violated, where n £ ah> and n appears in head(ah>)(a,h')- 
Then we find /3x,—,/3k-2 such that ^ fe /3fe_2, an, ah')- W.l.o.g. there 

is a chase step h" > h' minimal such that oth"(&h") is violated, where n £ 
ah" and n appears in head(ah")(ah")- Then we find f3\, ...,{3f--3 such that -<k.f 
(Pi, /3k-3, a>h, a^, ah")- Iterating this procedure, we obtain a subset of Z"s 
minimal fc-restriction system. Considering its strongly connected components, we 
observe that every such component is contained in some Ci due to monotonicity 
of the construction of a minimal restriction system. Thus, there must be some 
to £ [n] for which we have an infinite chase sequence starting with the instance 

Iq. □ 

The theorem follows from a repeated application of the lemma. The polynomial 
time data complexity follows from the fact that fc-restriction systems bound the 
depth of null values (see [TT]) data-independently. □ 

3.7 An Algorithmic Approach 

This section aims to develop an efficient algorithm to test membership in T[k]. 
We have seen before that the computation of fc-restriction systems is costly 
because we need NP time to compute the relation ~<k,p- For this reason, we 
present an algorithm that avoids the computation of fc-restriction systems where 
possible. It relies on the idea that (the weaker condition) safety can be checked in 



sub(X": Set of TDGs and EGDs, k: not equal to 1) { 


1 




if (S is safe) then 


2 




return true; 


3 




endif 


4 




compute the strongly connected components (as 






sets of constraints) Ci , . . . , C n of the minimal 






fc-restriction system of E\ 


5 




if (n == 0) then 


6 




return true; 


7 




endif 


8 




if (n == 1) then 


9 




if (Ci # 17) then 


10 


return check(Ci,fc); 


11 


endif 


12 


return false; 


13 


endif 


14 


for i=l to n do 


15 


if (not check(Ci,fc)) then 


16 


return false; 


17 


endif 


18 


endfor 


19 


return true; } 


check^X 1 : Set of TDGs and EGDs, k: not equal to 1) { 


1 




for i — k downto 2 do 


2 




if (sub(E,i)) then return true; 


3 




endfor 


4 




return false; } 



Fig. 8. Algorithm to decide membership in T[-]. 



polynomial time (cf. Section [23). Before computing the fc-restriction system, we 
always check for safety and, whenever safety holds, we conclude that the chase 
for the respective constraint set terminates and omit the fc-restriction system 
computation. 

To give a simple example, consider the constraint from Example |H1 which has 
been shown to be safe, and assume we want to test if it falls into some (fixed) level 
k of the T-hierarchy. Computing a fc-restriction system is superfluous, because 
membership in T[k] trivially follows from the satisfaction of the safety condition. 
In general, the situation is, of course, not that simple. Consider for instance 
the constraint set E' from Example 1131 extended by {a^as}, where := 
E(x\,X2) — ► T(xi,X2), «5 := T[xx,X2) — > T{x2,x\), and call the result- 
ing constraint set S" . Assume we want to show that £" is inductively re- 
stricted (i.e., in T[2]). It follows from Example 1101 that E" is not safe. In di- 
rect correspondence to Example [T3] it follows that the minimal 2-restriction 
system for E" is G\E"):=(E" ,{(a>x, a 2 ),{a2,ai),(a 3 ,a 1 ),(a 3 ,a2),{ai,a 4 ), 
(a 2 ,a 4 ),(a 4: ,a 5 ),(a5,a 5 )}), where f(ai) = f(a 2 ) := {E^E^S 1 }, f(a 3 ) := 0, f(a 4 ) 



Sample Schema: hasAirport(cjid) 




fly(c_idl, c_id2, dist) 
rail(c-idl, C-id2, dist) 


Constraint Set: E = {01,02,03}, where 




If there is a flight connection between two cities, 

both of them have an airport: 

fly(ci,C2,d) — > hasAirport{c\),hasAirport{c2) 


Ct'2 


Rail- connect ions are symmetrical: 
rail(ci,C2,d) — » rail(o2, ci, d) 


03 


Each city that is reachable via plane has at 
least one outgoing flight scheduled: 
fly{ci,c 2 ,d) -» 3c 3 ,d'fly(c 2 ,c 3 ,d') 



Fig. 9. Sample database schema and constraints. 



:= {E^E 2 } and f(os) := {T^T 2 }. This 2-restriction system contains the strongly 
connected components {01,012} and {05}- For {01,02} we must compute its 
minimal 2-restriction system because it is not safe, but for {05} we can avoid 
this complexity because we know that 05 is safe (indeed it is a full TGD) and 
therefore the chase terminates. We implement the scheme described above in 
algorithm check, provided in Figured] 

Proposition 6. Algorithm check terminates and correctly decides membership 
in the T-hierarchy, i.e. check(£, k) returns true if and only if £ S T[k\. □ 

Proof Sketch. The algorithm terminates because all recursive calls are made 
on constraint sets with size smaller than the input constraint set. What the 
algorithm does is trying to avoid the computation of fc-restriction systems 
by testing for safeness. The correctness follows from the proof of Theorem [7] 
because the only property we need to show is that for all £' £ part(£, k) the 
chase terminates, which is ensured by the additional safety checks. □ 



4 Data-dependent Termination 

So far, we discussed conditions that guarantee chase termination for every 
database instance. In this section, we study the problem of data-dependent ter- 
mination, i.e. given a constraint set £ and a fixed instance /, does the chase with 
£ terminate on II By the best of our knowledge, this problem has not been stud- 
ied before. Therefore, we start our discussion with a motivating scenario. Let us 
consider the travel agency database in Figure [HI where predicate hasAirport 
contains cities that have an airport and fly (rail) stores flight (rail) connections 
between cities, including their distance dist. In addition to the schema, con- 
straints 01-03 have been specified. For instance, 03 might have been added to 



assert that, for each city reachable via plane, the schedule is integrated in the 
local database. Now consider the CQ qi below. 

qi ■ rf(x 2 ) <- rail(a,xi,yi),fly(xi,X2,y2) 

The query selects all cities that can be reached from c\ through rail-and-fly. 
Assume that, in the style of semantic query optimization, we want to optimize 
qi under constraints E using the chase. We then interpret the body of q\ as 
database instance I := {rail(ci,xi,yi),fly(xi,X2,y2)}, where c\ is a constant 
and the Xi, j/j labeled nulls. We observe that a 3 does not hold on /, since there 
is a flight to city x 2 , but no outgoing flight from x 2 - Hence, the chase adds a 
new tuple t\ := fly(x2,x 3 , y 3 ) to /, where x 3 , y 3 are fresh labeled null values. In 
the resulting instance I' := I U {ti}, a 3 is again violated (this time for x 3 ) and 
in subsequent steps the chase adds fly{x 3 ,x^yi), fly(x4,,X5,y5), fly(xs, Xq, ye), 
. . . . Obviously, it will never terminate. 

Arguably, reasonable applications should never risk non-termination. It is clear, 
though, that the existence of (even a single) non-terminating chase sequences 
also means that no data-independent termination condition holds. Hence, based 
on data-independent conditions no query at all could be safely chased with the 
constraint set from Figure [9] and the benefit of the chase algorithm would be 
completely lost Despite the fact that there is a non-terminating chase sequence, 
however, there might be queries for which the chase with the constraint set from 
Figure [5] terminates. Tackling such situations, we propose to investigate data- 
dependent chase termination, i.e. to study sufficient termination guarantees for 
a fixed instance when no general termination guarantees apply. We illustrate the 
benefits of having such guarantees for query q 2 below, which selects all cities X2 
that can be reached from c\ via rail-and-fly and the same transport route leads 
back from x 2 to c\ (where c\ is a constant and the Xi, yi are variables). 

q 2 : rffr(x 2 ) <- rail(ci,x 1 ,y 1 ),fly(xi,x 2 ,y 2 ), 
fly(x 2 ,x 1 ,y 2 ), rail(xi,ci,yi) 

Query q 2 violates only a,\. It is easy to verify that the chase terminates for this 
query and transforms q2 into q' 2 : 

q 2 - rffr(x 2 ) <- rail(c 1 ,x 1 ,y 1 ),fly(x 1 ,X2,y 2 ), 
fly{x 2 ,x 1 ,y 2 ), rail(x 1 ,c 1 ,y 1 ), 
hasAirport(xi), hasAirport{x2) 

The resulting query q' 2 satisfies all constraints and is a so-called universal plan pQ : 
intuitively, it incorporates all possible ways to answer the query. As discussed 
in pp, the universal plan forms the basis for finding smaller equivalent queries 
(under the respective constraints), by choosing any subquery of q' 2 and testing 
if it can be chased to a homomorphical copy of q' 2 . Using this technique we can 
easily show that the following two queries are equivalent to q^. 

2 Note that, principally, query optimization could also be done with a bounded portion 
of the chase result, but in general we do not find minimal rewritings of the input 
query in the style of pQ. Therefore, it is desirable to guarantee chase termination. 



q%: rffr{x 2 ) <- rail(c l7 x 1 ,y 1 ) 7 fly(x 1 ,x 2 ,y 2 ), 

fly(x 2 ,x 1 ,y 2 ) 
q 2 : rffr(x 2 ) <— hasAirport(xi), rail(ci,xi,yi), 

fly{x 1 ,x 2 ,y 2 ),fly(x 2 ,x 1 ,y 2 ) 

Instead of q 2 we thus could evaluate q 2 or q' 2 " , which might well be more perfor- 
mant: in both q' 2 ' and q 2 the join with rail{x\, c±, y±) has been eliminated; more- 
over, if hasAirport is duplicate-free, the additional join of rail with hasAirport 
in q'^' may serve as a filter that decreases the size of intermediate results and 
speeds up query evaluation. This strategy is called join introduction in SQO 
(cf. [13] ). Ultimately, the chase for q 2 made it possible to detect q' 2 ' and q'^' , so 
it would be desirable to have data-dependent termination guarantees that allow 
us to chase q 2 (and q 2 , q 2 ). We will present such conditions in the remainder of 
this section. 



4.1 Static Termination Guarantees 

Our first approach to data-dependent chase termination is a static one. It relies 
on the observation that the chase will always terminate on instance / if the 
subset of constraints that might fire when chasing / with E is contained in some 
level of the T-hierarchy. We call a constraint a G E (I, E) -irrelevant if and 
only if there is no chase sequence such that a can eventually fire, i.e. no chase 

r 1 r t a l >°1 a, IT 

sequence ot the torm 1 — ► • • • — ► .... 

Lemma 4. Let k > 2 and £' C E s.t. E \ E' is a set of (I, I7)-irrelevant 
constraints. If E' e T[k], then the chase with E terminates for instance /. □ 

Proof Sketch. It holds that E' contains all constraints that may fire dur- 
ing the execution of the chase starting with I and E. I s is finite and I = I s . □ 

Hence, the crucial point is to effectively compute the set of (I, I7)-irrelevant 
constraints. Unfortunately, it turns out that checking (J, X')-irrelevance is an 
undecidable problem in general: 

Theorem 8. Let E be a set of constraints, a G E a constraint, and / an 
instance. It is undecidable if a is (I, I7)-irrelevant. □ 

Proof Sketch. It is well-known that the following problem is undecidable: given 
a Turing machine M and and a state transition t from the description of M, 
does M reach t (given the empty string as input)? From (M, t), we will compute 
a set of TGDs and EGDs Em and a TGD at € Em such that the following 
equivalence holds: M reaches t (given the empty string as input) <=>■ there is a 
chase sequence in the computation of the chase with Em applied to the empty 
instance such that at will eventually fire. 

Our reduction uses the construction in the proof of Theorem 1 in [5]. To be 
self-contained, we review it here again. We use the signature consisting of the 
relation symbols: T{x, a, y) tape "horizontal" edge from x to y with symbol a; 



H(x, s, y) head "horizontal" edge from x to y with state s; L(x, y) left "vertical" 
edge; R(x, y) right "vertical" edge; A$(x), Bs(x) for every stater transition S, one 
constant for every tape symbol, one constant for every head state, the special 
constant B marking the beginning of the tape and □ to denote an empty tape 
cell. The set of constraints Sm is as follows. 

1. The initial configuration: 

3w, x, y, zT(w, B, x), T(x, □, y), H(x, s , y), T(y, E, z) 

where □ is the blank symbol and so is the initial state (both are constants). 

2. For every state transition 6 which moves the head to the right, replacing 
symbol a with a' and going from state s to state s': 

T(x, a, y), H(x, s, y),T(y, b, z) -> 

3x', y', z'L(x, x'),R(y, y'), R(z, z'), T(x', a', y'), 

T(y',b,z'),H(y',s',z'),A 5 (w'). 

Here a, s, a', 6, and s' are constants. 

3. For every state transition S which moves the head to the right past the end 
of the tape replacing symbol a with a' and going from state s to state s': 
T(x, a, y), H(x, s, y),T(y, E, z) -> 

3w', x', y', z'L(x, x'),R(y, y'), R{z, z'), T(x', a', y'), 
T(y>, □, z'),H(y', s', z'),T(y>, E, w>), A s (w>). 
Here a, s, a', 6, and s' are constants. 

4. Similarly for state transitions which move the head to the left. 

5. Similarly for state transitions which do not move the head. 

6. For every state transition 5: 
A 5 (x) B s (x) 

7. Left copy: 

T(x, a, y), L(y, y') 3x'L(x, x'),T(x', a, y'). 
Here a is a constant. 

8. Right copy: 

T(x, a, y), R(x, x') -» 3y'T(x', a, y 1 ), R(y, y'). 
Here a is a constant. 

The state transition t is transformed to a t in the same way like in bullet six 
above. It is crucial to the proof that every state transition S in M is represented 
as a single TGD A$(x) — > Bg(x). The constraint for the initial configuration 
fires exactly once. The computation of the chase with this set of constraint 
can be understood as a grid and each row in the grid represents a configura- 
tion of the Turing machine. It can be shown that (M,t) is a yes-instance if 
and only if (Sm, a t ) is a yes-instance. Thus, the equivalence from above holds. □ 

This result prevents us from computing the minimal set of constraints that may 
fire when chasing /. Still, we can give sufficient conditions that guarantee (I, £)- 
irrelevance for a constraint. For this purpose, we use the chase graph. 

Proposition 7. Let / be an instance and E be a set of constraints such that 
every constraint has a non-empty body. Further let aj := 3x /\ R ,-,, eI R(x r ) 



where x := \J R ^-,^ £I x 1 . If the c-chase graph G C (E U {a/}) contains no directed 
path from ojj to [3 € I?, then /3 is (J, X')-irrelevant. □ 

Proof Sketch Assume that (3 is not (I, I7)-irrelevant. Then, there is a chase se- 

Ql,aT T a 2 ,02 Q r ,a7 T /3,a Tr „ „ • i j ,".,1 

quence 1 — > 1\ — > ■ ■ ■ — > l r — > .... It a/ -< c pwe are nmsned. Utnerwisc, 
there must be some n r € [r] such that a„ r -< c /3 (otherwise /3 could not fire). If 
a/ -< c ot nr we are finished. Otherwise, there must be some n r -\ € [n r — 1] such 
that an,._i -<c OL rir (otherwise a„ r could not fire). After some finite amount of 
iterations of this process we have that ai -< c a ni ^ c ... ~< c a nr -< c P- Therefore, 
the chase graph contains a directed path from ai to (3. □ 

Proposition [7] together with Lemma [4] gives us a sufficient data-dependent con- 
dition for chase termination, as illustrated in the following example. 

Example 16. Consider constraint set £ from Figured and qi from the beginning 
of this section. We set 

ai := 3ci,xi,x 2 ,yi,y2 rail(c 1 ,x 1 ,y 1 ),fly(x 1 ,x 2 ,y 2 ), 

fly(x 2 ,x 1 ,y 2 ), rail{xi,ci,yi) 

and compute the chase graph 

G(S U { ai }) := (S U {a/}, {(a/, ai), (a 3 ,a 3 )}). 

By Proposition^ a 2 and a 3 are (I , S) -irrelevant. It holds that S \ {012,013} = 
{oti} is inductively restricted, so we know from Lemma^that the chase of q 2 with 
S terminates. Similar argumentations hold for q' 2 ' and q' 2 " from the beginning of 
Section^ □ 



4.2 Monitoring Chase Execution 

If the previous data-dependent termination condition does not apply, we propose 
to monitor the chase run and abort if tuples are created that may potentially 
lead to non-termination, an approach that is dynamical by nature. We introduce 
a data structure called monitor graph, which allows us to track the chase run. 

Definition 17. A monitor graph is a tuple (V, E), where V Q A nu u x 2P 0S ^ 
and E C V x E x 2^ os ^ X V. □ 

A node in a monitor graph is a tuple (n, it), where n is a null value and it the set 
of positions in which n was first created (e.g. as null value with the help of some 
TGD). An edge (ni,TTi,ipi, II, n 2 ,TT 2 ) between (ni,7Ti), (n 2 ,-K 2 ) is labeled with 
the constraint cpi that created n 2 and the set of positions 77 from the body of 
ifi in which n\ occurred when n 2 was created. The monitor graph is successively 
constructed while running the chase, according to the following definition. 

Definition 18. The monitor graph Gs '■= G r w.r.t. S = Iq . . . Vr 1 J r 

is a monitor graph that is inductively defined as follows 



• Go = (0, 0) is the empty chase segment graph. 

• If i < r and (fi is an EGD then G,+i := G;. 

• If i < r and (pi is a TGD then Gi+i is obtained from G; = (H^Vi) as 

follows. If the chase step E J i+1 does not introduce any new null values, 
then Gj+i := G,;. Otherwise, Ei+i is set as the union of Ei and all pairs 
(n, n), where n is a newly introduced null value and tt the set of positions in 
which n occurs. V i+ \ := ViU{ (ni,iri,ipi, II, n 2 , n 2 ) | (ni,7ri) G Ei, (n 2 ,7r 2 ) £ 
Ei + {\Ei and 77 is the set of positions in body((pi(ai)) where 
n\ occurs }. □ 

The size of the monitor graph is polynomial in the length of the chase sequence 
plus the length of the constraints' encoding. We illustrate the definition of the 
chase graph by a small example. 

Example 17. Consider the constraint U3 = {0*3}, where 03 := 
S(xs), Rk(x\, X2, X3) — * 3yRk{y, xi, X2) from Examvle 1 1 5\ Assume we have 
an instance of the form Iq := {S(ai), £(02), 8(03), E(a%, a 2 , 03)}. Then, the 
only chase sequence is Iq — > I\ — > 1% — » 73, where l\ = Iq U {E(y\, 01,02)}, 
h = h U {75(2/2,2/1, ai)} ^3 = la U {£(2/3, 2/2,2/1}- -4s 2/1 is "oi m re- 
lation S the chase terminates. The monitor graph contains the path 

(j/1,17? 1 }) Q -^» (2/2, {7? 1 }) (2/3, {75 1 }) p/ws an additional edge 

( yi ,{ E i}) a ^ (2/3, {S 1 })- □ 

Our next task is to define a necessary criterion for non-termination on top of 
the monitor graph. To this end, we introduce the notion of k-cyclicity. 

Definition 19. Let G = (V, E) be a monitor graph and fc G N. G is called fc- 
cyclic if and only if there are pairwise distinct edges V\, ...,v^ G E such that 

• there is a path in E that sequentially contains v\ to and 

• for all i G [fc - 1]: P2,3,4,6(«») = 2?2,3,4,6(Vj+l)- D 

Example 18. Consider the scenario from Example \17\ According to the previous 
definition, the chase graph presented there is 2-cyclic, but not 3-cyclic. □ 

We call a chase sequence k-cyclic if its monitor graph is fc-cyclic. A chase sequence 
may potentially be infinite if some finite prefix is fc-cyclic, for any fc > 1: 

Lemma 5. Let fc G N. If there is some infinite chase sequence S when chasing 
7o with S, then there is some finite prefix of S that is fc-cyclic. □ 

Proof: Assume that 



— we have an infinite chase sequence S — (Tj)jeN and 

— there is some fc G N such that every finite prefix of S is not fc-cyclic. 



Let (5i)i g N be the sequence of finite prefixes of S (such that Si is a chase sequence 
of length i) and let (Gs^i&i the respective sequence of monitor graphs. A path 
in a monitor graph is a finite sequence of edges ei, e; (and not of nodes) such 
that P5,e(e») = Pi, 2(^+1) for ie[l- I]. 

We define the notion of depth of a node in a monitor graph. Let v be a node 
in Gsi and pred(v) the set of predecessors of v. In case u has no predecessors, 
the depth of v, deptho s (v), is defined as zero. In case v has predecessors, then 
depthc s . (v) := 1 + max{ depth(j s , (w) \ w G pred(v) }. 

The following claim follows immediately from the definition of the monitor graph. 
The formal proof is left to the reader. 

Proposition 8. Let v be a node in Gs t and j > i, 

• Gsi is an acyclic labeled tree. 

• Every null value that appears in Ii appears in some first position of a node 
in Gsi ■ 

• There is a homomorphisrrj^l hij from Gs { to Gs j such that depth(j s .(v) < 
depth Gs . (hij(v)). 

• If Ii b G Hi is a null value and c is a null value that was newly 
created in this step, then the depth of any node in Gs i+1 in which b appears 
is strictly smaller than the depth of any node in Gs i+1 in which c appears. 
(The proof is by induction on i.) □ 

The next proposition is the most important step in the proof of this lemma and 
follows directly from bullet four in Proposition [51 

Proposition 9. Let i G N. For every d G NU{0} there is a number kd G N such 
that for every i G N it holds that |{ v depthc s . (v) < d }| < kd- Note that kd is 
independent from i. (The proof is by induction on d.) □ 

We observe another fact. 

Proposition 10. There is some pk G N such that if some Gsi has a path of 
length pk, then Si is /c-cyclic. □ 

This is because we have only a bounded number of relational symbols and con- 
straints available. The remaining step in the proof is to show that if we choose i 
large enough, then Gs t contains a path of length pf.. Assume that this claim does 
not hold. By Proposition [9J the number of nodes of a certain depth is bounded 
(independent of i). So, if for any i there would be no path of length pk in Gs^ 
then the number of nodes in Gs t would be bounded (independent of i). This im- 
plies that the chase has introduced only a bounded number of fresh null values, 
which contradicts to the assumption of an infinite chase sequence. □ 



3 A homomorphism leaves relational symbols and constraints untouched, i.e. is the 
identity on elements from A. 



To avoid non-termination, an application can fix a cycle-depth k and stop the 
chase when this limit is exceeded. For every terminating chase sequence there 
is a k such that the sequence is not fc-cyclic, so if k is chosen large enough the 
chase will succeed. We argue that fc-cyclicity is a natural condition that considers 
situations that may cause non-termination, so this approach is preferable to 
blindly chasing the instance and stopping after a fixed amount of chase steps. 
As justified by the following proposition, applications can choose k following a 
pay-as-you-go principle: for larger fc-values the chase succeeds in more cases. 

Proposition 11. For k G N there is E k , Ik such that (a) both E k and the 
subset of constraints in E k that are not (Ik, £'fe)-irrelevant arc not inductively 
restricted; (b) every chase sequence for Ik with E k is (k — 1)-, but not fe-cyclic.D 

Proof: We set E k := {ip} and I k = {S(cx), ...,S(c k ),Rk(ci, ■-, c k )}, where ip := 
S(x k ),Rk(xi, -,x k ) -> 3yRk(y,xi, ...,x k -i). 

First observe that E k contains no (/, Zfc)-irrelevant constraints, so the subset 
of the constraints in Ek that is not (/, Z')-irrelevant equals to E k . It is easy to 
verify that E k is not inductively restricted, although the chase with E k always 
terminates, independent of the underlying data instance, so condition (a) holds. 
We now chase I k with E k . There is only one possible chase sequence (Ji)o<i<k, 
defined as Jo := Ik, for i < k: Ji := Jj_i U {R(rii, ...,n\,c\, ...,c k -i)}, and 
m, ■■■,nk are fresh null values. It holds that J k \= E k . 
The monitor graph w.r.t. (Ji)o<i<fc is (V,E), where 

V :={ {n t ,Rl) | i S [k] } and 

E := { (n u Rl, if, R^\ n 3 ,R\) \ l<i<j<k}. 

We observe that the sequence is (k — l)-cyclic because 
(ni,R\,(p,R\,n 2 ,R\), ...,(nk^\,R\,,ip,R\,n k ,R\) constitute a path in the 
chase graph that satisfies the conditions of the definition of (k — l)-cyclicity. 
The chase sequence is not fc-cyclic because there is no path of length at least k 
in the monitor graph. This proves part (b) of the proposition. □ 

5 An Application 

Answering Conjunctive Queries on knowledge bases has recently gained attrac- 
tion [5)6] . Such knowledge bases typically have a set of constraints associated, 
which imply additional tuples that are not materialized in the knowledge base 
itself. An important problem is query answering on the implied knowledge base. 
If the chase with these constraints terminates, query answering can be done by 
answering it on the chased knowledge. However, if no termination guarantees for 
the chase can be made, more sophisticated techniques for query answering are 
required. This problem was first considered in [13] and then generalized in [S] 
and 0. In this section we leverage the methods developed in Section [31 showing 
that they can be used to make the algorithms given in |5l6j applicable to broader 
classes of constraints. 



In [5] the class of so-called weakly guarded TGDs was introduced, which make 
query answering under constrained databases decidable. We first review this 
notion. Later, we will generalize weakly guarded TGDs with our methods. Our 
starting point is the definition of treewidth. 

Definition 20. Let £ be a set of TGDs. We call £ weakly guarded if for every 
a G £ there exists g a G body(a) such that for any 7r 6 aff(E) D pos(a) and 
every variable x n that occurs in 7r it holds that x v occurs also in g a . □ 

If £ is weakly guarded, we abbreviate this by WGTGD{£). It was first shown in 
[5] that if WGTGD(£), then answering Conjunctive Queries on I s is decidable 
for every database instance /, even though I s may be infinite. Although not 
stated explicitly, it follows from the proof of Lemma 27 in [5] that the crucial 
property for decidability of query answering of WGTGDs is that in every chase 
step there is an atom in the body of the constraint under consideration that con- 
tains all labeled nulls. We state this observation more precisely in the following 
definition. 

Definition 21. Let S be a chase sequence starting with the instance /. S has 

the guarded null property if for every chase step I 1 —^> I" in S there is an atom 
in body{a)(a) that contains every element from (afl A nu u)\dom(I) that occurs 
in head(a)(a). □ 

With this definition at hand we can generalize Lemma 27 in [5] to the following 
version, which follows implicitly from the proof of Lemma 27 in 5 . 
Next, we need to introduce the notion of treewidth. A hypergraph is a pair 
TL = (V,H), where H C 2 V . The Gaifman graph of a hypergraph H, Gn, has 
the same set of nodes like the hypergraph and contains an edge between two 
nodes (vx,V2), whenever there is some h G H such that v±,V2 £ h. A tree 
decomposition of a graph Q — (V,E) is a pair (T,B), where T = (N,A) is a 
graph and B : N -» 2 V such that (i) B(N) = V, (ii) for every (v 1; v 2 ) 6 E 
there is some n G N such that {«i,«2} G B{n), and (iii) for every v G V the 
set { n G N | v G B(n) } is the set of nodes of a connected subtree of T. The 
width of (T, B) is max{ \B(n)\ - 1 | n G N }. The treewidth of Q, tw(G), is the 
minimum width of all tree decompositions of Q. The treewidth of a hypergraph 
is defined as the treewidth of its Gaifman graph. Therefore the treewidth of a 
database instance is defined. Note that we consider an atom B,(a, b, c, d) as the 
set {a, b, c, d} here. We denote by tw(I s ) the treewidth of I s . 

Lemma 6. If all chase sequences w.r.t. £ and / have the guarded null property, 
then tw(I s ) < \dom(I)\ + max{ ar(R) \ R G Tl }. □ 

Straightforwardly, we obtain the following theorem that is obtained from a result 
in [7] and the observation that in case that all chase sequences have the guarded 
null property, then if I s A Q and I s A are satisfiable, they have models of 
finite treewidth (because I s has such a model). 



Theorem 9. There is an algorithm that, for every set of TGDs S, Conjunctive 
Query q and database instance / such that every chase sequence has the guarded 
null property, correctly computes q(I s ). □ 



Unfortunately, it is not known if it is decidable if all chase sequences have the 
guarded null property (given E and / as input), which justifies the research 
regarding sufficient syntactic restrictions on the constraint set such that all chase 
sequences with this constraint have the guarded null property. 
The notion of affected positions is a rough syntactic overestimation on where 
labeled nulls may occur in a constraint body during the execution of the chase. 
With the help of 2-restriction systems, we can improve this overestimation. The 
following definition states that every TGD a must have an atom in its body 
that contains all variables occurring in /(a), where / is the function from the 
constraint set's minimal restriction system (cf. Definition I12p . Intuitively, f(a) 
defines the set of positions in which null values may occur during the execution 
of the chase. 

Definition 22. Let E be a set of TGDs and G'(S) = (G'J) its minimal 2- 
restriction system. We call E restrictedly guarded if for every o S E there exists 
g a £ body(a) such that for any 7r G / and every universally quantified variable 
in body (a) that occurs in ir it holds that x^ occurs also in g a . □ 

We call g a a restricted guard and write RGTGD(E) to denote that constraint 
set E is restrictedly guarded. 

Example 19. Consider the set of constraints E := {01,0:2,03}, where oi := 
R(xi,x 2 ),S(x 1 ,x 2 ) — > 3yS(x 2 ,y), o 2 := S(xi, x 2 ), S(x 3 , xi) — > R(x 2 ,xi) and 
03 :— T(xi,x 2 ) — > 3yS(y,x 2 ). The set of affected positions is aff(E) = 
{S 1 , S 2 , R 1 , R 2 } and therefore a 2 violates the condition for weak guardedness 
because there is no atom that contains xi 1 x 2 ,x 3 . However, the constraint set 
is restrictedly guarded. The minimal 2-restriction system ((S,E),f) contains 
only the edges E(ai,a 2 ),E(a 3 ,a 2 ) (and no other edges) and we have that 
f := IS* 2 ,-/? 1 }. The body of a 2 contains the atom S(xi,x 2 ) which serves as 
its restricted guard. □ 

Next, we relate restricted guardedness to weak guardedness and also show the 
crucial property that restricted guardedness ensures the guarded null property. 

Lemma 7. Let E be a set of TGDs. 

• WGTGD(S) implies RGTGD(S). 

• There is some S s.t. RGTGD(S), but not WGTGD(S). 

• For every database / it holds that if RGTGD(S), then every chase sequence 
with £ and / has the guarded null property. □ 

Proof Sketch. Let (G" , /) be the minimal 2-restriction system for E. We 
can show by induction on the number of steps needed to compute it that 
/ ^ a ff{£)- This implies bullet one. Bullet two is proven by Example fT9l Bullet 



three follows from the observation that if a constraint is violated during the 
execution of the chase, say J ^ /3(a), then every (afl A nu u)\dom(I) appears 
in some position g 1 ^ of some restricted guard gp in the body of [3. From the 
construction of the minimal 2-restriction system it follows that g\ G /. □ 

As our final result, Lemma [7] and Theorem [9] imply: 

Corollary 1. There is an algorithm that, for every RGTGD(S), Conjunctive 
Query q and database instance /, correctly computes q{I s )- □ 

6 Conclusions 

We studied the termination of the well-known chase algorithm. By the best of 
our knowledge, this was the first study that - in addition to the constraints - 
takes the specific instance (respectively query) into account. We also started to 
study the area of sufficient termination conditions for the chase which ensure, 
independently of the underlying data, the termination of at least one chase 
sequence and not necessrily of all. As another major contribution, we generalized 
all sufficient data-independent termination conditions that were known so far. 
Our results on chase termination directly carry over to applications that rely on 
the chase, such as |8ll3l4ll2ll5l2l21llll9j . and also to the so-called core-chase 
presented in [9]. As a sample application, we applied our novel concepts in the 
context of [5], showing that they can be used to identify a larger set of TGDs 
for which the methods in that paper apply. 

There are some interesting open questions left. First, it is unknown if the mem- 
bership test for T[k], which has been shown to be in CONP, is also coNP- 
complctc. Second, it is left open if IJ fc>2 T[k] is still decidable. Finally, it is an 
interesting question if the positive results on core computation in data exchange 
settings from [11] extend to the T-hierarchy. 
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